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Abstract 



The multi-species coalescent provides an elegant theoretical frame- 
work for estimating species trees and species demographics from ge- 
netic markers. Practical applications of the multi-species coalescent 
model are, however, limited by the need to integrate or sample over 
all gene trees possible for each genetic marker. Here we describe a 
polynomial-time algorithm that computes the likelihood of a species 
tree directly from the markers under a finite-sites model of mutation, 
effectively integrating over all possible gene trees. The method ap- 
plies to independent (unlinked) biallelic markers such as well-spaced 
single nucleotide polymorphisms (SNPs), and we have implemented it 
in SNAPP, a Markov chain Monte-Carlo sampler for inferring species 
trees, divergence dates, and population sizes. We report results from 
simulation experiments and from an analysis of 1997 amplified frag- 
ment length polymorphism (AFLP) loci in 69 individuals sampled 
from six species of Ourisia (New Zealand native foxglove). 
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Introduction 



Biallelic markers such as single nucleotide polymorphisms (SNPs) and ampli- 
fied fragment length polymorphisms (AFLPs) are potentially rich sources of 
information about species radiations, species divergences and historical de- 
mographics. However, extracting this information is not always straightfor- 
ward. Patterns of genetic variation at these markers are not just a product 
of the relationships between the species; they also reflect inheritance pat- 
terns within each species. Any full-likelihood (or full-Bayesian) method for 
inferring species histories from genetic markers needs to model the random 
distribution of gene tree histories for each marker. To date, this task has 
often meant implementing massive Monte-Carlo simulation-based sampling 



of not only species trees, but also the gene trees at each locus (Rannala and 



Yangl [20031 |Wilson et al.| [20031 [Lluand Pearlj [20071 |Hey and Nielsenj [2007 



Heled and Drummond, 2010) 



In this paper, we describe an algorithm that allows us to bypass the gene 
trees and compute species tree likelihoods directly from the markers. The 
likelihood values, or posterior probabilities, computed by the algorithm are 
identical to those that would be obtained by sampling every possible gene 
tree, and every possible set of branch lengths, at each locus. The algorithm 
makes use of new formulae for lineage and allele probabilities under the coa- 



lescent (Bryant et al. , 2011), and of recently developed numerical techniques 



to evaluate these formulae. 
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Our approach makes the following assumptions regarding the data: 



(Al) Each marker is a single biallelic character (e.g. a biallelic SNP or AFLP 
banding pattern); 

(A2) The genealogies for separate markers are conditionally independent 
given the species tree. In practice this assumption applies to unlinked 
or loosely linked markers. 

This latter assumption is clearly not valid for sites in a single gene sequence. 
However, it is satisfied for SNPs that are well-spaced along the genome. If 
the independence assumption (A2) is only partially violated, the effect of 
linkage could be investigated by subsampling sets of markers with varying 
degrees of independence. 



In principle, the full-likelihood methods of Liu and Pearl (2007) and Heled 



and Drummond (2010) could be applied to data satisfying (Al) and (A2) 
by encoding each marker as a separate locus. This strategy would quickly 



become infeasible as the number of markers increased. Nielsen et al. ( 1998 ) 



demonstrated that a full-likelihood approach is tractable for data satisfy- 
ing (Al) and (A2), presenting an algorithm that uses biallelic characters 
directly to compute the likelihood of the species tree. However, their al- 
gorithm did not permit mutation, and it was computationally feasible only 



for small species trees. RoyChoudhury (2006) and RoyChoudhury et al. 



(2008) made a substantial advance on the computational problem. They 



took the approach of Nielsen et al. (1998) and placed it within a dynamic 



programming framework, thereby giving an efficient algorithm for computing 
the likelihood of a tree with an arbitrary number of species, but no muta- 
tion. Here we extend the dynamic programming structure of |Roy Choudhury 



et al. (2008) to incorporate mutation. We address the extensive algorithmic 
and mathematical challenges resulting from this deceptively minor change in 
model assumptions. 

We have implemented a new likelihood algorithm and incorporated it within a 
Bayesian Markov chain Monte-Carlo (MCMC) sampler which we call SNAPP 
(SNP and AFLP Phytogenies). SNAPP, which interfaces with the BEAST 
package [Drummond and Rambaut (2007), takes a range of biallelic data types 



as input and returns a sample of species trees with (relative) divergence 
times and population sizes. We have tested and validated the algorithm 
and the software using a range of techniques, and we report on results of 
two experiments with simulated data. The software is open source, and is 
available for download from 



http : //www . maths . otago . ac . nz/ software/ snapp 



To illustrate the application of SNAPP, we analyse AFLP loci in 69 individu- 
als sampled from six species of New Zealand Ourisia, or native foxglove. The 
New Zealand Ourisia form a relatively recent species radiation, and inference 
of branching patterns between these species has proven difficult, arguably due 



to the presence of ancestral lineage sorting (Meudt et al. , 2009 ). Our Bayesian 



analysis provides a relatively clear picture of ancestral species relations in the 
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group and, up to a scale constant, effective population sizes. 



Methods 



The multi-species coalescent 



Our models are all based on the assumption that the lineage dynamics within 
populations (or species) are well described by the conventional Wright-Fisher 
model. The distribution of the gene trees within each population is approx- 



Felsenstein 


(2004 


); 


Hein et al. 



(2005); Wakeley (2009)). This process models the number of ancestral lin- 
eages of the sample from a single population as a Markov process that goes 
backwards in time. Initially, the number of ancestral lineages equals the size 
of the sample. Going backwards in time (upwards in a branch) lineages meet 
at common ancestors, and the number of ancestral lineages decreases. 

It is customary in coalescent theory to rescale time in terms of effective 
population size, so that two lineages coalesce at rate 1. This rescaling is 
not generally possible in the multi-species coalescent since different species 
can have different effective population sizes. Instead we adopt the standard 
practice from phylogenetics and rescale time in terms of expected mutations 



as m 



Rannala and Yang (2003)). Hence the expected time to a coalescence 



for two lineages is 8/2 and the expected time to a coalescence for k lineages is 
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jpjzjy, where 9 denotes the expected divergence between any two individuals 
in the population. 

At the first coalescent event, two lineages are selected at random and com- 
bined, and we are left with k — 1 lineages. This coalescence of lineages 
continues until we reach the top of the branch, at which anywhere from 1 to 
k lineages could be present. 

The nodes in the species tree represent species divergences, or population 
splits. The individuals in each of the child populations are descendants of 
individuals in the parent population. In terms of the coalescent process, the 
lineages coming upwards from the child population become lineages at the 
base of the parent population. This process continues upwards in the species 
tree until we reach the species tree root. At this point, any remaining lineages 
coalesce according to the standard single-population coalescent model. 



See Felsenstein (2004), Degnan and Rosenberg (2009), and Heled and Drum 



mond (2010) for general introductions to the multi-species coalescent. Early 



contributions to the development of multi-species models built on the branches 



of a species tree were made by Tajima (1983), Hudson (1983), Takahata and 



Nei (1985), Nei (1987), Pamilo and Nei (1988) and Takahata (1989). 



The multi-species coalescent determines a distribution for gene trees and 
their branch lengths, conditional on a species tree. The parameters of the 
distribution are the shape of the species tree, the divergence times within 
the species tree and the population sizes along the branches of the species 
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tree (one parameter for each branch). We bundle these parameters into the 
single composite parameter S, so that the probability of a gene tree G given 
the species tree is P(G\S). We treat this quantity as a density rather than a 
discrete probability because of the continuous branch lengths of G. 

Let X denote the alignment of sequences for a locus. Conventional phyloge- 



netic models (e.g. Felsenstein (2004)) give us the probability that X evolved 
along a specified gene tree G. These models provide the distribution of states 
at the root and the mutation probabilities down the edges of the tree and 
hence determine P(X\G), the probability of the data (alignment) given the 
gene tree. Note that once the gene tree is chosen, the species tree has no 
further influence on the probability of the data. 

Putting P(G\S) and P(X\G) together, we obtain the joint probability (or 
density) of the alignment X and the gene tree G: 



P{X, G\S) = P(X\G)P(G\S). (1) 

The gene tree G is not observed directly and it can be difficult to estimate. 
Since our focus is on the species tree and the features of the species tree, we 
work with the marginal probability of the data, summing over all possible 
gene tree topologies and integrating over the branch lengths: 

P(X\S) = [ P{X\G)P{G\S)dG. (2) 
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Eq. ([2]) is sometimes called the Felsenstein equation ( Felsenstein 1988; Hey 



and Nielsen, 2007). 



Generally, we consider multiple genetic markers. We assume that the gene 
trees for each marker are independent. Let Xi be the alignment for the ith 
gene and let Gi be a corresponding gene tree. Under the assumption that the 
alignments are indeed independent, the total probability of the m alignments 
at m genes is a product over all of the genes: 

m m „ 

P(X 1 ,X 2 ,...,X m \S) = l[P(X i \S) = l[ / P(X t |G,)P(G J |S)rfG J . (3) 

i=i i=i ^ 

If we were to plug this formula into a Bayesian analysis, we would specify 
a prior distribution P(S) on the species trees, and then sample from the 
posterior distribution 

P(S\X u ...,X m ) ^{flj P{X l \G l )P{G l \S)dG^i P{S). (4) 

Sampling from P(S\X%, . . . , X m ) is equivalent to sampling from the joint 
posterior distribution 

P{S, G 1 ,...,G m \X 1 ,...,X m )oc (j[ PiXilG^PiGilS)^ P{S) (5) 

and only considering the marginal distribution of the species trees S. This 



is the approach taken by BATWING (Wilson et al. , 2003), BEST (Liu and 
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Pearll [20071) and STAR-BEAST QHeled and Drummondl poTol . Note that if 



the actual gene trees Gi are provided, or if they can be inferred with high 
accuracy, they can be treated as data and the species tree can be inferred 



directly (Degnan and Salter, 2005 Kubatko et al. ( 2009). 



At this point it is appropriate to reflect on what exactly is required when 
applying Q or Q to large numbers of independent biallelic markers. To 
evaluate the likelihood exactly, we would need to sum (or integrate) over all 
possible gene trees. In a Bayesian setting, we would need to sample over 
a space containing not only every possible choice of species tree, but also 
every possible choice of gene tree for every locus. Furthermore, the marginal 
probabilities for the gene trees depend not only on the data but on the species 
tree, and so the analyses for the separate genes are all interdependent. An 
analysis of 1000 independent loci then amounts to 1001 inter-linked Bayesian 
analyses (1000 gene trees and one species tree). Even with modern Monte- 
Carlo algorithms, this type of analysis is computationally daunting. 



Overview of the likelihood algorithm 



We circumvent the computational difficulties by calculating the integral in 
^ analytically. In the following sections we describe a pruning algorithm 
that we use to compute the likelihood of a species tree given genotype data 
at unlinked biallelic markers. The algorithm works in a similar manner to 



Felsenstein's pruning algorithm ( Felsenstein 1981) for computing the likeli- 



hood of a gene tree: we define partial likelihoods that focus only on a specific 
subtree; the partial likelihoods are then computed starting at the leaves (of 
the species tree), working upwards to the root. 

There are two major differences. In Felsenstein's pruning algorithm, one 
partial likelihood is defined for every node and every state (i.e. amino acid 
or nucleotide). In our algorithm we have separate partial likelihoods for the 
top and bottom of each branch in the species tree, for every possible number 
of ancestral lineages at each point, and for every possible count of the number 
among these lineages carrying each allele. 

Secondly, we need to deal with the complication that the coalescent process 
works backwards in time (and is not reversible) while the mutation process 
works forward in time. We were not able to define a simple transition process 
taking numbers of ancestral lineages to numbers of descendant lineages. In- 
stead we first compute probability distributions for the numbers of ancestral 
lineages at each node in the species tree. We then define partial likelihoods 
for subtrees in the species tree and derive the equations required to compute 
them efficiently. Finally, we show how to handle the probabilities at the root 
of the species tree when computing the full probability of the genotype data 
of a marker. 

We orient trees so that the ancestral nodes are at the top and time travels 
downwards. Thus, the base of a branch in the species tree corresponds to 
the population at the time nearest to the present, while the top of a branch 
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corresponds to the population just after it has diverged from its ancestral 
population. In a similar fashion, the genotypic state in a gene tree will 
evolve from the top of the gene tree (the common ancestor) downwards to 
the leaves. 

Red and green alleles 

The multi-species coalescent model for the evolution of markers (SNPs, AFLPs 
etc.) has two components: the model for the gene trees in the species tree, 
and the model for the markers evolving down the gene tree (that is, forward 
in time). The model for gene trees uses a coalescent process that works 
backwards in time whereas the mutation model for genetic markers (SNPs, 
AFLPs, etc.) typically works forwards in time. 

Given a gene tree with branch lengths specified, we model the evolution 
of a genetic marker using standard phylogenetic machinery. Suppose that 
there are two alleles, which we label red and green. Let u be the rate of 
mutation from the red allele to the green allele per unit time (forward in 
time), and let v be the corresponding rate of mutating from green to red. 
We say that a lineage is a red lineage if it has the red allele and a green 
lineage otherwise. 

The allele of the most recent common ancestor (MRCA) at the root of the 
gene tree is red with probability n = and green with probability (1 — 7r) = 
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-T-. The marker evolves down the gene tree as a continuous-time Markov 

u+v ° 

chain whose instantaneous rate matrix has rate u of mutating from red to 
green and rate v for mutating from green to red. The alleles at the leaves 
of the gene tree are then the observed alleles. The probability of the allele 
frequencies at a marker, given the species tree, is therefore the probability of 
the site given a gene tree multiplied by the probability of the gene tree given 
the species tree, summed over all possible gene tree topologies and integrated 
over all possible gene tree branch lengths (eq. (|3|). 



Ancestral lineage counts and the likelihood 

The multi-species coalescent can be used to generate a random gene tree 
conditional on a species tree. If we take any node or point in the species 
tree, we can count the number of lineages in the gene tree in that species 
at that point in time. We say that at a specified time point, this quantity 
is the number of ancestral lineages. The count of ancestral lineages is a 
random variable with distribution determined by the multi-species coalescent 
process and its resulting distribution of gene trees. The first step in our 
likelihood algorithm is the calculation of these lineage count distributions. 



See RoyChoudhury et al. (2008) and Efromovich and Kubatko (2008) for 



similar computations. 

Let x be a branch (i.e. ancestral species) in the species tree. Let denote 
the number of gene tree lineages at the base of the branch x. Let nj denote 
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the number of ancestral lineages at the top of the branch and let t be the 
length of the branch, measured in units of expected number of mutations 
(see Figure [l]). The minimum possible value for and nj is one, while 
the maximum possible value is the total number of individuals sampled in 
populations at or below x, which we denote by m x . The distribution of nj 
given is given by the probability in the standard coalescent model of going 
from n ancestors to k ancestors over time t (measured in units of expected 
mutations): 

Pr n x = k\n = n = > e « — — — — '— u -, (6) 

where ri[ r ] = n(n — l)(n — 2) • • • (n — r + 1) and n( r ) = n(n + 1) • • • (n + r — 1), 



(Tavare, 1984) 



When x is an external branch (adjacent to a leaf) in the species tree, 
equals the number of samples from the species corresponding to that branch. 
Let n x denote this number of samples. Then 



1 if n = n x ; 

Pr[n}? ://] J (7) 
otherwise. 



Suppose that x is an internal or external branch in the species tree. Let 
m x denote the maximum possible value for or nj, equal to the total 
number of sampled individuals summed across populations at or below x in 
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Figure 1: Gene trees in species trees. Each branch in the species trees corresponds 
to a species that is either contemporary (A,B,C) or ancestral (x,y). The present- 
day samples are represented by coloured squares along the lower edge of the tree; 
their colours denote the sampled alleles. The red (dashed) and green (solid) lines 
trace out two possible gene trees for these individuals, the red-green colouring 
indicating which allele is carried by which lineage at any particular time. The 
random variables and equal the number of lineages, and the number of 
red lineages, respectively, at the bottom of the branch for ancestral species x. The 
corresponding values at the top of this branch are denoted n J and r J , respectively. 
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the species tree. Suppose that Pr[n^ = k] has been computed for all k from 
1, 2, 3, . . . ,m x . The value of nj is determined by the value of using the 
conditional probabilities in (J6]): 

Pr [nj = n] = ^ Pr[n* = k] Pr[nJ = n\n* = k}. (8) 

k=n 

Now suppose that x is an internal branch in the species tree and that branches 
y and z are attached to the base of branch x. There is no time for coalescent 
events between the tops of the branches y and z and the bottom of the branch 
above x. Hence, n^ = riy + nj and 

n 

Pr[n^=n] = ^ Pr[n^ = k] Pr [n J = n - k]. (9) 

fc=0 

Equations ([7]), ^ and ^ together provide a method for computing Pr[n^ = 
n] for all nodes x in the species tree and all n > 1. In our implementation, 
the nodes are visited in a postorder traversal, in order from the leaves up 
to the root, so that a node is always visited after the required probabilities 
for the children have already been computed. When considering a branch 
attached to a leaf in the species tree, we use ^ to compute Pr[n^ —n) for 
all n and (|sj) to compute Pr[nJ = n] for all n. At an internal branch, we use 
^ to compute Pr[n^ = n] for all n and ^ to compute Pr[nJ = n] for all 
n. 
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Computing the partial likelihoods 

We now introduce mutation. We associate the two colours red and green 
to the two alleles and colour each branch of the gene tree according to the 
allele state along the branch. Hence a gene tree node will be red or green, 
depending on whether the corresponding lineage carries a red or green allele 
at that point. A mutation along a lineage is represented by a change in colour 
along the branch, and a branch can have multiple colour changes. 

Recall that denotes the number of gene tree lineages at the bottom of a 
particular branch x in the species tree. We let denote the number of these 
lineages that carry the red allele at that point, so that < < n^. In the 
same way, we let r J denote the number of lineages carrying the red allele at 
the top of the branch x, so that < r J < nj. 

Let r z denote the number of red alleles in a data set observed in the species 
associated with an external branch z. Our objective is to compute the joint 
probability of (r^ = r z ) over all leaves z in the species tree, conditional on 
the species tree, sample sizes, and model parameters. To this end, we define 
a partial likelihood equal to the conditional likelihood for a subtree of the 
species tree. Let 1Z X denote the event that (rjf = r 2 ) holds for every external 
branch z that is a descendant of branch x in the species tree. That is, 1Z X is 
short-hand for the event that the allele counts below x correspond to those 
observed in the data for a single genetic marker. For every node x of the 
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species tree, and every choice of n and r, we define 



F* (n, r) = Pr[7^K = r * = r] Pr K = "] 



(10) 



and 



Fj(n,r) = Pr[7^|nJ = n,rJ = r 



•] Prfn^ =n]. 



(11) 



We will see that the values F^(n, r) and Fj(ra, r) can be computed by starting 
at the leaves and working upwards towards the root, just as in Felsenstein's 
likelihood algorithm. Furthermore, when x is the root of the tree, the prob- 
ability for the entire marker can be determined from the values F^(n, r). 
Technically speaking, F^(n, r) is not a partial likelihood; rather it is the 
product of a partial likelihood (Pr[7?. x |n^ = n, = r]) and a probability 
(Pr[n^ = n}). This latter term simplifies the mathematics further on, and 
makes the computation more numerically stable. In the same way, Fj(n, r) 
is the product of a partial likelihood (Pr[7£ x |nJ = n, r J = r]) and a prob- 
ability (Pr[nJ —n]). We will show that these quantities can be computed 
using dynamic programming and then used to compute the probability of 
the marker. 

We note that the computation can be extended to multifurcating species 
trees by converting a multifurcating tree into a bifurcating tree. This is done 
by replacing any multifurcation with a series of bifurcations, all separated by 
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branches of length 0. The probabilities of the marker will be unchanged. 



Partial likelihoods for a leaf 

The simplest case for computing the partial likelihood is when the branch x is 
attached to a leaf (that is, when x is external). The number of samples from 
the associated species is n x and the number of individuals is r x . Hence 



Partial likelihoods along a branch 

Let y be a branch for which F^(n;,,r&) has already been computed, for all 
n b and r b . We carefully manipulate the conditional probabilities to obtain 
an expression for Fy(n t ,r t ). As before, we let m y denote the number of 
individuals in sampled from species at or below y. Starting with the definition 
of Fy we have 




otherwise. 



1 if n = n x and r = r. 



X 



(12) 



F^(n t ,r t ) = Pr[7^|nJ=n t ,rJ=r t ] Pr[n]; = n t ] 



m y n b 



= Yl ^ Pl [ n y\ n y= n t' r y= r t' n y= n b, r y= r b\ 



n b =nt rb=0 



x Pr [nj = n 6 ,rj = r b |nj=n t ,rj=r t ] Pr [n J 



n t \. 



(13) 
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Recall that m y denotes the total number of individuals sampled from popu- 
lations at, or below, branch y. We now use the fact that 1Z y is conditionally 
independent of and given n B and r B , so that Pr[7^|nJ, , n B , r B ] = 
Pr[ft> B r B ], and 

m y n b 

Fj(nt,r t ) = X^ Pr [^l n ? = nb ' r ? = rb ] Pr ^ = nb ' r ? = rb l n y =n *' r y =r '] Pr [ r 

ny=n t r 6 =0 

my Tlb F B (n b ,r b ) 
= p 1b!„ j Pr[< = n b ,rg = r b |n^ = n f ,r^ = r f ]Pr[n^ = n f ]. 

n h =n t r 6 =0 rr[n s/~ n6j 

After rearranging the conditional probabilities we have 

p r r n B Bi t tt p r [ r B | n B n T r T lPrfn B ln T r T l 

L 2/ ' 2/ 12/ ' V J L 2/ I 2/ ' V ' 2/ J 1- 2/ I 2/ ' 2/ J' 

which simplifies to Pr[r B |n B , n^, rj] Pr[n B |n^] as the number of red lineages 
at the top of the branch is conditionally independent of the number of lineages 
at the bottom, given the number of lineages at the top. Applying Bayes 
rule 

Prfn T l 

Prjn B |n T l — yJ = Prbi T |n B l 

l ll y \ Ll y J p r J n Bj t V I 2/ J' 

we obtain 

m y n b 

F^(n t ,r t ) = 5^F B (n 6 ,r 6 )Pr[r B =r 6 |n B =n 6 ,nJ = n t ,rJ = r t ]Pr[nJ = n t |n B = 

n h =n t r b =0 

(14) 
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The term Pr[n^ = n t \u^ = n b ] is evaluated using ^ above. Computing 
Pr[r^|n^, nj, rj] is more involved. In the special case that u = v = a. closed- 



form expression exists for this probability (Slatkin, 1996; Nielsen et al. , 1998 



RoyChoudhury et al. 2008). To our knowledge, a closed- form expression in 



the general case has not previously been derived; rather, we express this 
probability using a matrix exponential. 

Define the matrix Q with rows and columns indexed by pairs (n, r) with 



St(n,r);(n,r-1) 
Q(n,r);(n,r+1) 
Q(n,r);(ra-l,r) 
(n,r);(n— l,r— 1) 
Q(n,r);(n,r) 



(n — r + l)v 

(r + l)u 

(n — 1 — r)n 

e 

(r — l)n 


n(n — 1) 
6 



(n — r)v — ru 



< r < n 
< r < n 
0<r<n (15) 
< r < ri 
< r < n 



and all other entries zero. Here n ranges from 1 to the number of individuals 
sampled while for all n we have < r < n. Hence Q has 



m ^ 

^^(n + 1) = -m(m + 3) 



n=l 



rows and columns. We note that Q is not the generator of a process, and 
the connection with the coalescent and mutation processes is somewhat in- 
direct. The most important feature of the matrix is its role in computing the 
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conditional allele probabilities required for the partial likelihoods: 



Theorem 1 (Bryant et al, 2011, Theorem 1) Suppose that n individuals 
are sampled from a Wright-Fisher population. Let n T denote the number of 
ancestral lineages at some time t in the past, and let r T be the number of 
these that carry the red allele at that time. Then 



T) r B I B T T 1 eX Plv!''J(n,r);(?i t ,r t ) / 1c s 

Pr r = rn =n, n =n t ,r =r t \ = — y — ^ '-. (16) 

rrln 1 =n t n a = n 



See Bryant et al. (2011) for a proof of Theorem [TJ 



Partial likelihoods at a speciation 

Suppose that a branch x represents a population that diverges into two pop- 
ulations, corresponding to branches y and z. Each of the n lineages at the 
bottom of branch x came up either from the top of branch y or from the top 
of branch z. If n y is the number that came up from branch y, then n — n y is 
the number from branch z. The conditional joint distribution of and nj 
is then 



r t t r P T [ n v =n y\ Pr[nJ=n — n y ] 
Pi[n y =n y ,n z =n-n y \n x =n} = ■ pr ^ B = ^ , (17) 
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and is computed using fl7p, pi), and (J9]). The conditional distribution of red 
allele counts is given by the hypergeometric distribution: 



( n y\ fn-n y \ 

n r T T I T T B B i \r y )\r—r y ) / 1 c 

rr[r y = r y , r z = r—r y \n y =n y , n 2 =n—n y , n x =n, r x =r\ = — "—^ — — . (1? 







The value of n y can range from n y = 1 (one lineage coming from branch 
y) to n y = n — 1 (all but one lineage coming from branch y. Combining 



(17) and (18), and summing over n y and r y , and applying (10) and (11) we 



obtain 



(n-l) r 

EE' 



// \ n yi r y)F z ( n Tiy, r r y 



fn y \ rn-n v \ 
\r y /\ r—r y ) 



(19) 



This equation gives the joint probability of the allele counts in the subtree 
below branch y and the subtree below branch z, conditioned on the sum of the 
respective lineage counts equalling n and the sum of the respective red lineage 
counts equalling r. Note however that the way we have defined (n, r), and 
Fj(n, r) means these quantities are not partial likelihoods but include also 



the lineage count probabilities at the nodes (see (10) and (11)). 
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Root probabilities 



Let p denote the root of the species tree. The probability of the observed data 
at a genetic marker, conditional on the species tree and model parameters, 



is 



Pr[n p ] = Yl Pr [^pl n ? = n > r P = r ] Pr t n P = n l Pr [ r P = r l n p = n \ 

n=l r=0 
m p n 

= EE F >' r ) pr [ r P B=r i n P=^ (2°) 



n=l r=0 



Here, m p is the total number of individuals sampled. The term Pr[r^ = 
r|n^ = n] requires some assumptions about what happens in the population 
above the root. 

Using diffusion models, it can be shown that the allele frequencies in a single 



population have an approximately beta distribution (see, e.g. Ewens (2004) 



pg 174), and this is the distribution used for the root allele probabilities in 



RoyChoudhury et al. (2008). It was shown in Bryant et al. (2011) that exact 



probabilities under the coalescent model can be derived from Theorem [T] 
above. 

Let N and R be the number of lineages and red lineages sampled from a 



single population of constant size. Let Q be the matrix defined in ( 15 ) and 
let x be be the non-zero solution for Qx = such that X( 10 ) + X(i,i) = 1- 



The vector x is indexed by pairs in the same way as Q. Theorem 2 of Bryant 
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et al. (2011) states then gives Pr[R = r|N 



■ n 



k "( n ! ? ") 



for all n and and r. 



The matrix Q is highly structured and x can be computed using a simple 
recurrence in 0(m 2 ) time, where m is the maximum value for n. 
Algorithm snappLikelihood 

Computes the log-likelihood for biallelic data at a genetic marker 

for each branch x of the species tree in a post-order traversal 

compute Pr[n^ = n] for all n, using ^ if x is external and ^ otherwise, 
if not at the root, compute Pr[nJ = n] for all n using ([8]). 

end(for) 

compute Pr[R p = r|N p = n] for all n,r using Theorem 2 of Bryant et al. (2011) 



for each marker i 

for each branch x of the species tree in a post- order traversal 



compute F^(n, r) for all n, r, using (12) if x is external and (19) otherwise, 
if not at the root, compute Fj(n,r) for all n,r using (14). 
end(for) 



compute Li = Pr[7^. p ] using (20). 
end(for) 

return X!i lo g(-^)- 



Figure 2: High-level outline of the algorithm to compute the log-likelihood 
of a set of unlinked biallelic markers, given the species tree. A branch x in 
the species tree is external if it is adjacent to a leaf, otherwise it is internal. 
In Q and ^ we use y and z to denote the branches attached to the base 
of branch x. 



Time complexity 

Fig. [2] gives a high-level description of the algorithm for computing the like- 
lihood of a species tree given a collection of unlinked biallelic markers. The 
time complexity of the algorithm is dominated by two calculations. The first 



is the evaluation of F^(n, r) for all n, r using (19). A direct implementation 
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of the formula would require 0(n 4 ) time per marker, per branch in the species 
tree, where n is the number of individuals sampled. However the applica- 
tion of two-dimensional convolution algorithms reduces this complexity to 



0(n log n) by using the fast Fourier transform; see Bracewell (2000) 



The second time-consuming calculation is the computation of exp 



which 



is required for the application of (14). We found that standard diagonalisa- 



tion techniques were both computationally expensive and numerically unsta- 



ble. Instead, we use the fact that after rearranging (14) we only need to be 



able to evaluate exp(Qt)v for different vectors v. For this computation we 



implemented a Caratheodory-Fejer approximation based on Schmelzer and 
Trefethen (2007) that runs in 0(n 2 ) time per species tree node. To check nu- 



merical accuracy, we also implemented the expokit algorithm of Sidje (1998), 



which is slower than the method of Schmelzer and Trefethen (2007) but has 
more numerical safety checks. 

In summary, the time complexity of our likelihood calculation, per marker, 
is 0(sn 2 log n), where n is the number of individuals and s is the number 
of species. We implemented a dynamic cache-based system to store partial 
likelihood values for different subtrees, and multi-threading to take advantage 
of parallel computation on multiple core machines or graphics processing 
units. 
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The SNAPP sampler 



We implemented our likelihood algorithm as the core of a Markov chain 
Monte Carlo (MCMC) software package SNAPP, which takes biallelic data 
(e.g. SNPs or AFLP) at multiple loci in a set of species and returns samples 
from the joint posterior distribution of 

1. species phylogenies; 

2. (relative) species divergence times; 

3. (relative) effective population sizes along each branch. 

Note that our method does not sample gene trees; it only samples the species 
tree and its parameters. 

The software is open source, and is available for download from 



http : //www . maths . otago . ac . nz/ software/ snapp 



We have implemented a range of standard priors in the SNAPP package. 

1. The stationary allele proportions 7r and 7Ti were fixed at the observed 
proportions of red and green alleles in the data. In our experience, 
the posterior distribution of ir or -k\ is tightly peaked at the observed 
value. These observed proportions also determine mutation rates u and 
v, since we measure time in units of expected mutations. 



2. As in Drummond and Rambaut (2007), we assumed a pure birth (Yule) 



model for the species tree topology and species divergence times, with 
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a hyper-parameter A equal to the birth rate of the species tree. This 
hyper-parameter was either fixed or allowed to vary with an (improper) 
uniform hyper-prior. 



3. Following Rannala and Yang (2003 ), we used independent gamma prior 



distributions for the population size parameters 9. We used fixed val- 
ues for the shape a and inverse scale (3 parameters for the gamma 
distribution. 

Later, we list the parameters for the prior distributions that we used for 
simulations and for the analysis of the Ourisia data. 

The MCMC proposal functions implemented in SNAPP are standard, and are 



a subset of those available BEAST (Drummond and Rambaut, 2007), when 



sampling from molecular clock trees. Briefly, we implemented moves that 
raise or lower single nodes in the species tree, a move that swaps subtrees in 
the species tree, moves that alter 9 values for single or multiple populations, 
and several moves that alter branch lengths and 9 values simultaneously. 



See Drummond et al. (2002) for a detailed discussion of these moves, and 
Huelsenbeck et al. (2001 ) for a general overview of the use of MCMC methods 
to sample phylogenies. The selection of proposal moves is expected to change 
frequently as the MCMC algorithm is improved, see the (online) SNAPP 
manual for details. 

The execution of the MCMC, and outputs, are controlled via a BEAST- 
style XML-file, which can be constructed using a graphical user interface. 
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The user has the option of outputting a range of parameters and statistics 
based on the chain; many more are available by passing the output tree 
files through TreeStatQ Convergence can be assessed for several statistics 
(e.g. likelihood values, tree length, tree height, summary 9 values) visually 



using Tracer (Rambaut and Drummond, 2007) on multiple chains, and using 



the Gelman-Rubin diagnostic (Gelman and Rubin, 1992). Credibility sets 



are determined by ranking the trees in the sample by decreasing sample 
frequency, and keeping as many trees as were necessary to obtain a total of 
95% of the sample (after burn in). 

Note that we report only relative and not absolute estimates of 9 values. The 
reason is that once non-polymorphic sites are removed, the allele frequencies 
of the polymorphic sites appear to contain little or no information about 9. 
Under the infinite-sites model, the expressions giving allele count probabili- 
ties for for a polymorphic site do not involve 9. Hence, if we condition on sites 
or markers being polymorphic, the frequency data provides no information 
about the absolute value of 9. In fact the proportion of segregating sites is a 



sufficient statistic for 9 in a single population (RoyChoudhury and Wakeley 



2010) under the infinite-sites model. 



The situation might be different for the multi-species coalescent, but the 

following heuristic argument suggests it is not. One interpretation of the 

infinite-sites model for polymorphic sites is that we are conditioning on there 
1 htt p : // tree . bio . ed . ac .uk/ software / treest at / 
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being exactly one mutation in the gene tree. It follows that the allele prob- 
abilities for a polymorphic site are invariant to rescaling the length of all 
branches in the gene tree by a constant. If we rescale the divergence times 
and 9 values by a constant, the effect on the gene tree distribution will be 
the same as simply rescaling all the gene trees by the same constant. As 
this can have no effect on the marker probabilities, the likelihood function is 
flat. It is not clear the extent to which the same applies for the finite-sites 
model, however we would expect similar behaviour between the two when 
the divergence rates are low. 

Simulations 

We have tested the likelihood algorithm and the SNAPP software extensively. 
The likelihood algorithm and sampler were originally implemented in C++ 
and then subsequently re-implemented in Java. Core calculations, such as 
those required to apply Theorem 1, were independently re-implemented and 
tested in MATLAB. The likelihood values returned by the algorithm for two 
species and a small number of individuals were identical to those obtained 
analytically from gene tree probabilities. 

We have implemented a simulator called SimSnapp which generates bial- 
lelic polymorphic markers on a species tree according to the multi-species 
coalescent. Internally, SimSnapp generates a gene tree within the species 
tree and then evolves the marker along that gene tree. If the marker is 
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not polymorphic, then both gene tree and marker are discarded. We note 



that MCMC-COAL (Rannala and Yang, 2003) and Mesquite (Maddison and 



Maddison, 2010) could also have been used to generate gene trees, though 
the fact that most gene trees are discarded made it more efficient to combine 
gene tree simulation and character simulation within a single program. 
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Figure 3: Species trees used for simulations. These trees are identical to those 



used by Liu and Pearl (2007) to assess the reconstruction of species trees from 
known gene trees. 6 values are indicated on the tree, and branch lengths are 
drawn to scale with respect to time. All external branches have 9 = 0.006, and 
branch lengths are drawn to the same scale. (A) A 'hard' four-taxa tree, the 
difficulty stemming from the short branch separating A, B and C from D. (B) An 
'easy' four-taxa tree. (C) A 'hard' eight-taxa tree, made difficult by the two short 
branches. (D) An 'easy' eight-taxa tree. 



In our simulations, we began with the two four-species trees and two eight- 



species trees used in simulation experiments of Liu and Pearl (2007) Fig. |3j 
Two of the trees were classified as 'hard' due to a short internal branch that 
would be difficult to resolve: we would expect to need more individuals or 
more markers to accurately infer these trees. We used the same 9 values as 



Liu and Pearl (2007) for all internal branches and 9 = 0.006 for the external 



branches. 

In our first experiment we tested whether, as the number of sites increased, 
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the species tree used to generate data appeared in the credibility set produced 
by SNAPP. This experiment tests the likelihood algorithm and the sampling 
algorithm simultaneously. As the number of sites increases, the likelihood 
function should concentrate around the true value (assuming identifiability) 
and, consequently, so should the posterior distribution. We also ran SNAPP 
with four choices of prior. Fo the 9 values we considered a 'correct' prior with 
expectation close to the values used in simulation, and an 'incorrect' prior 
with values averaging 10% of the true values. In the same way we considered 
a 'correct' prior for species rate corresponding roughly to tree heights of the 
simulated trees, as well as an 'incorrect' prior giving tree heights of around 
50% of the true value. 

Chains were started from random trees and initial parameter values drawn 
from the prior, with chain lengths of 200,000 for the four-taxa case and 
400,000 for the eight-taxa case, these lengths being determined by conver- 
gence tests on preliminary runs. For this experiment we sampled only one 
individual per species. 

For the second experiment, we examined the posterior distribution of diver- 
gence time and 9 values on a fixed tree, using the 'easy' four-taxon tree from 
before. We simulated ten individuals for each of the four species. We used 
the same priors as before ('correct' and 'incorrect') and generated chains of 
length one million. For each choice of prior we ran two sampling regimes. For 
the first scheme, we fixed divergence times (node heights) at their true value 
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and sampled 9 values only. For the second scheme, we sampled divergence 
times and 9 values at the same time. 



Analysis of Ourisia AFLP data 



Meudt et al. (2009) investigated the utility of amplified fragment length poly- 
morphism (AFLP) markers for species delimitation and reconstruction of evo- 
lutionary relationships between New Zealand populations of Ourisia (Plan- 
taginaceae), the native foxglove. Molecular evidence suggests that Ourisia 
species have radiated fairly recently (between 0.4 and 1.3 mya), adapting 



rapidly to a range of habitats, from sea level to alpine herbfields (Meudt 



et al. 2009). 



AFLP markers are a readily available source of whole-genome information, 
well suited to the analysis of closely related species, particularly in the ab- 
sence of whole-genome sequences (see review in |Meudt and Clarke| ( |2007| )). 



The analysis in Meudt et al. (2009) used a collection of 2555 non-constant 
amplified fragment length polymorphism (AFLP) markers for 193 Ourisia 
individuals, sampled from 100 locations in New Zealand and three locations 
in Australia. Several contrasting tree-based and cluster-based methods were 
applied, identifying 15 distinct meta-populations. They also detected strong 
evidence for a split between a large-leaved group and a small-leaved group, 
the molecular signal for the split being consistent with differences in both 
morphology and habitat. The relationship between the species within each 
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of these groups was not well resolved by any method. Meudt et al. (2009) 
argued that this lack of resolution is not due to introgression or insufficient 
diversity. Two plausible explanations given are the effect of incomplete lin- 
eage sorting and the potentially low ratio of phylogenetic signal to noise in 
AFLP data. 

Here we applied SNAPP to AFLP data from all members of the large-leaved 
group, producing a data matrix of 69 taxa and 1997 characters. We used a 
diffuse gamma prior for the 9 values (a = 10, (3 = 100), with independent 
9 values on each branch. We used a pure birth (Yule) prior for the species 
tree, with birth rate A sampled from an improper uniform hyper-prior. We 
generated one chain of 1.4 million iterations (sampling every 500 iterations), 
and two shorter chains (100,000 iterations each) to check convergence. 



Results 

Simulation experiments: recovering the species tree 

For the first experiment, we tested whether SNAPP would recover the tree 
used to generate the data. We expected the true tree would be in the 95% 
credibility set for almost all replicates, and that the size of the credibility set 
would shrink as the number of sites increased. 
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Table 1: The size of the credibility sets in the first simulation, 'tree' indicates 
which of the trees in Fig. [3] was used to generate data, 'c' and T indicate 
whether 'correct' or 'incorrect' priors were used on the 9 values and on the 
speciation rate. Numbers 100 to 1000000 indicate the number of polymorphic 
sites generated. Values in the table are the numbers of trees in the credibility 
set. The seven instances where the true tree was not contained within this 
set are marked by an asterix (*). 

Four taxa with an 'easy' tree: In all simulations, the 95% credibility set 
contained the true tree and no other trees. 

Four taxa with a 'hard 7 tree: the true tree was in the 95% credibility set for 
all except one instance. The credibility set contained three trees (the three 
resolutions of the short branch) for data sets with 100 to 1000 sites. The 
credible set shrunk to one (true) tree with 10,000 or more sites. The two 
priors had little effect on outcomes. 
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Eight taxa with an 'easy' tree: In all simulations, the 95% credibility set 
contained the true tree and no other trees. 

Eight taxa with a 'hard' tree: The true tree was in the 95% credibility set 
for all except two simulated data sets of 300 and 800 sites. Otherwise the 
true tree was contained in the credibility set. There were at least three trees 
in the credibility set even for data sets with one million sites: SNAPP was 
unable to resolve the short edge in the species tree. 

Overall, the credibility sets contained the true trees in nearly all the experi- 
ments, which is a good indication that the likelihood computation is working 
correctly. 

Simulation experiments- recovering parameters 

The marginal posterior distributions for the node height and 9 parameters 
appear in Fig. |4j Once again, the data were analysed using two different sets 
of priors. The corresponding posterior densities are represented by solid and 
dashed lines in the figure, and it is apparent that at least in this case, there 
is very little difference when using the 'correct' or 'incorrect' prior. 

As well as varying the prior, we varied the selection of variables being sam- 
pled. This can be used to explore covariances in the posterior. In one case, 
we sampled 9 values and node heights, and in the other case we fixed node 
heights at their true value and sampled only 9 values. Fixing the heights 
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greatly reduced the posterior variance of the 9 estimates and shifted them 
closer to the true value. This could well be due to the (approximate) con- 
founding of 9 and time. Simultaneously scaling the 9 values and branch 
lengths induces little change in the likelihood. The fact that the posterior 
distribution of 9 reduces so markedly when heights are fixed indicates that 
this invariance under scaling is making a substantial contribution to the vari- 
ance of the posterior. 

If we compare 9 and node height estimates for different parts of the species 
tree, we see that both sets of posterior distributions become more diffuse as 
we head away from the leaves of the species tree and towards the root. This 
is clearly indicated by the density plots of Fig. |4j which depict the density of 
the sampled 9 values minus their true values (the offset). The 9 distributions 
for the four observed populations (A,B,C,D in Fig. |1J) are comparatively 
peaked and close to the true value (zero-offset in the figure). In contrast, the 
9 distribution for the root is extremely diffuse, especially when we do not 
fix the node heights. In the same way, the estimation of the height for the 
root node for these examples appears much more difficult than for the node 
closest to the leaves. 

If we are to make reliable estimates of 9 values for an ancestral population, 
we need coalescent events to occur along the corresponding branch. When 
simulating the data, we recorded the numbers of lineages at each node in 
the species tree. There were 10 lineages sampled from each population (A), 
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(B), (C), (D). There were, on average, 2.8 lineages at the base of population 
(A,B), 2.3 lineages at the base of (A,B,C) and 2.0 lineages at the base of 
the root population. Hence, there were very few coalescent events along 
internal branches of the species tree, explaining the more diffuse posterior 
distributions for the corresponding 6 values. 



Ourisia data 

We carried out 1.5 million iterations of the SNAPP MCMC algorithm in 
approximately 80 hours of computing time on an Opteron 24 core desktop 
computer. Additional chains were run to test convergence. 

The species tree topology represented in Fig. [5] has a posterior probability of 
70%. The second most probable topology (15%) differed only by the position 



of the root. Interestingly, the AFLP phylogeny of Meudt et al. (2009), which 
was obtained using a MrBayes analysis that ignored lineage sorting, had less 
than 5% posterior probability. 

SNAPP found significant support for several relationships between species in 
this 'large- leaved' group. The (O. vulcanica, O. calycina) clade and the (0. 
macrophylla, O. crosbyi) clade both have significant posterior probability. 



The former clade also appears in the MrBayes tree of Meudt et al. (2009), 
though the latter does not. 

One feature of the tree in Fig. [5] is that the divergence times for all of the 
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clades are early relative to the age of the tree. This result is consistent with a 
rapid species radiation at the base of the large-leaved group where an initial 
swift expansion if followed by a period of consolidation. 

The 9 estimates reported in Fig. [5] are relative estimates only, and many have 
high posterior variance. One anomaly is the 9 estimate for O. macrophylla 
subsp. lactea which is at least twice that of other species. A Neighbor-Net 



(Bryant and Moulton, 2004) of the AFLP data reveals considerable substruc- 



ture, and O. macrophylla subsp. lactea is not monophyletic in neighbor- 



joining or parsimony analyses (Meudt et al. 2009). Hence, the high 9 value 



could well represent fragmentation within the subspecies or poor delimitation 
of the subspecies with respect to O. macrophylla subsp. macrophylla, rather 
than a large population. 

In summary, by taking lineage sorting into account, we have been able to 
extract a well supported phylogenetic tree for Ourisia species 'large-leaved' 
group. Earlier tree-based and cluster-based analyses were unable to extract 



such a clear signal from these data (Meudt et al. 2009). Our analysis did 



not support the same tree as a Bayesian phylogenetic analysis which ignored 
incomplete lineage sorting. Furthermore, our 9 estimates indicate potential 
fragmentation or poor delimitation in O. macrophylla subspecies lactea. 
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Figure 4: Posterior distribution for 9 and node heights, simulated data on 
a four taxon tree. 9 values are offset by subtracting off the corresponding 
'true' value on the model tree. Solid lines indicate used of 'correct' prior; 
dashed lines indicate use of 'incorrect' prior. Dark lines denote density when 
only 9 values sampled; gray lines denote density when both divergence times 
and 9 values sampled. A) The model tree used for the second simulation; B) 
posterior distribution for 9 on external branches; C) posterior distribution for 
9 on internal branches; D) posterior distribution for 9 at root; E) posterior 
distribution for node heights. 
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O. calycina 



O. macrophylla subsp. lactea 



O. macrophylla subsp. macrophylla 



O. crosbyi 



Figure 5: Species tree with the highest posterior probability (70%) for six 
'Large-leaf Ourisia species. The thickness of bars proportional to 9 values for 
the respective populations. 6 values for each population printed on the pipes. 
These are relative values only, and were scaled so that sum of 9 values for 
present day populations is fixed at 1. The posterior probabilities for internal 
nodes are printed on an angle. 
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Discussion 



We have presented a method that takes biallelic markers sampled from multi- 
ple individuals from multiple species and computes the likelihood of a species 
tree topology together with population-genetic parameters. Our approach 
implements a full multi-species coalescent model without having to explicitly 
integrate or sample the gene trees at each loci. With our MCMC sampler, 
SNAPP, we can concentrate on the parameters of interest: the species tree, 
population sizes and divergence times, rather than on the problem of travers- 
ing through the space of potential gene trees. The likelihood values we com- 
pute are exact (up to numerical error) and do not resort to a simplification 
or approximation of the full coalescent model. 



Our methods differ from those of Nielsen et al. (1998) and RoyChoudhury 



et al. (2008) by the inclusion of mutation. While mutation is rare in SNP 



data from the most closely related populations, it can play a significant role 
in the evolution of markers for more distant species, for trees with multiple 
species, or when analysing markers such as AFLP that may have higher 
mutation rates than SNPs. 



To incorporate mutation we have derived fundamental new results on the evo- 
lution of biallelic markers under the coalescent (Theorems 1 and 2). These 



formulae extend work of Tavare ( 1984 ) on the distribution of ancestral lin- 



eage counts, and of Slatkin ( 1996 ) on the evolution of markers in a population 
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without mutation. It combines the coalescent process, which operates back- 
wards in time, with the mutation process, which works forwards in time. 



The SNAPP sampler complements methods such as BEST (Liu and Pearl 



2007) and Star-Beast (Heled and Drummond, 2010), which sample gene trees 



explicitly. Each is suited to a different kind of data. BEST and Star-Beast 
can analyse tightly linked markers such as whole gene sequences, whereas 
SNAPP assumes that all markers are unlinked. It should not be applied to 
whole genes. Unlike the other methods, SNAPP can analyse tens of thou- 
sands of unlinked markers, something that would not be practical if it were 
necessary to explicitly jointly sample one gene tree for each marker in a 
Monte-Carlo algorithm. 

We have reported some of the analyses we have performed to validate the 
algorithm and our implementation, and to assess the ability of the method to 
infer phylogenetic and demographic parameters. Considerable scope exists 
for a more extensive investigation into the strengths, weaknesses and charac- 
teristics of the methodology with respect to other approaches. A wide variety 
of factors affect the performance of this method, or indeed any method, when 
inferring trees and parameters, (i) Sufficient mutation must have occurred, 
but not so much as to cause loss of signal; (ii) 9 values can only be reliably in- 
ferred for ancestral populations if there are sufficient coalescent events within 
these populations; (iii) If 6 values are too high, all coalescences will occur 
within the root population and no method will be able to infer the tree or 



42 



6 values. Alternatively if the values are too low, then all coalescences will 
occur along pendant branches (as in the examples above), and only patchy 
information will be available about phylogenetic relationships and popula- 
tion sizes closer to the root of the species tree. These issues are likely to 
be faced not only by SNAPP, but by any method inferring phylogenies and 
population sizes. 

Because of the potential lack of information about aspects of the species histo- 
ries, it is of critical importance that methods report not just point estimates, 
but measures of uncertainty, irrespective of whether they follow Bayesian 
of frequentist paradigm. Priors either need to be sufficiently diffuse, or to 
encode actual information. 

We have only briefly discussed the steps taken to improve the computational 
efficiency of the algorithm and the SNAPP software. A need exists for fur- 
ther improvements in speed, particularly as the number of individuals in 
the sample increases. One possibility might be to discard coalescent theory 



and revert to diffusion models for allele frequency change (Siren et al. 2010) 
as these remove the computational difficulties arising from large numbers 
of individuals. A hybrid approach is also possible in which computation- 
ally efficient continuous models can provide approximate likelihoods in the 
SNAPP MCMC algorithm, promising substantial increases in sampling ef- 



ficiency (Fox, 2008). An alternative, but intriguing, possibility might be to 



derive a continuous approximation to the SNAPP likelihood calculation it- 
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self, as opposed to approximating the allele frequency changes in the entire 
population. 

Currently, SNAPP enables a full coalescent analysis of hundreds of thousands 
of biallelic markers taken from dozens of individuals in multiple species. We 
see no significant limiting computational factor that, in time, would prevent 
a SNAPP analysis involving hundreds of thousands of markers and hundreds, 
instead of dozens, of individuals. 
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